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Abstract 


This  paper  reports  on  new  algorithms  for  computing  the 
Hough  transform  on  mesh  array  architectures.  The  mesh 
is  fine-grained,  consisting  of  an  N  x  N  array  of  processors, 
each  holding  a  single  pixel  of  the  image.  The  mesh  array 
operates  in  an  SIMD  mode.  Several  algorithms,  differing  in 
the  techniques  they  use,  their  asymptotic  complexity,  or  the 
architectural  resources  required,  are  presented  for  computing 
the  Hough  transform.  The  main  algorithm  computes  any  P 
angles  of  the  Hough  transform  in  0(N  +  P)  time  and  uses 
only  a  constant  amount  of  memory  per  processor.  All  the 
algorithms  apply  to  the  more  general  problem  of  computing 
the  Radon  transform  of  gray-level  images. 


Parallel  computing  for  image  processing  and  computer  vi¬ 
sion  has  recently  received  considerable  attention  [1-4].  Tech¬ 
nological  advances  have  made  possible  fine-grained  architec¬ 
tures  [5-7],  Mesh  connected  computers  are  of  particular  in¬ 
terest  to  the  image  community  [9-15].  Many  practical  al¬ 
gorithms  have  been  implemented  on  both  fine-grained  ar¬ 
rays  [16-17],  and  coarse-grained  meshes  [18].  Theoretical  ad¬ 
vances  for  mesh  computer  algorithms  have  also  been  reported 
[19-20],  although  it  will  take  some  time  before  these  ideas  can 
be  put  into  practice.  This  paper  addresses  the  problem  of 
calculating  the  Hough  Transform,  and  more  generally  the 
Radon  Transform,  on  a  mesh  array  computer.  The  fastest 
published  algorithm  for  this  problem  calculates  P  projections 
of  the  Hough  Transform  in  O(PN')  time  [8].  Algorithms  will 
be  presented  here  that  calculate  P  such  projections  in  0(P 
-t-  N)  time. 


A  mesh  array  computer  consists  of  an  N  x  N'  array  of  pro¬ 
cessing  elements  (PEs)  where  each  PE  consists  of  a  processor 
and  an  associated  memory.  The  PEs  operate  in  a  Single  In¬ 
struction  Stream,  Multiple  Data  Stream  (SIMD)  mode,  with 
all  control  signals  coming  trom  a  single  control  unit.  The 
control  unit  reads  instructions  from  its  private  memory,  de¬ 
codes  them,  and  broadcasts  the  control  signals  to  the  PE 
array,  in  addition  to  broadcasting  the  control  information 
to  the  processors,  the  control  unit  typically  sends  addresses 


to  the  memory  units,  so  every  PE  accesses  the  same  memory 
location  at  a  given  time.  Although  there  is  no  direct  data 
connection  from  the  controller  to  the  array,  there  are  situa¬ 
tions  when  a  number  must  be  broadcast  from  the  controller 
to  the  PEs.  This  is  accomplished  by  using  the  control  lines 
to  cause  the  PEs  "to  calculate”  the  number  one  bit  at  a  time. 

There  is  assumed  to  be  a  special  register  in  each  PE. 
called  a  mask  register,  which  when  an  instruction  is  sent  from 
the  controller,  only  those  PEs  with  a  1  in  their  mask  register 
perform  the  instruction;  all  others  do  nothing.  This  allows 
operations  to  be  performed  on  a  subset  of  the  PEs  in  a  data 
dependent  manner.  Of  course,  there  are  some  instructions 
that  operate  on  all  PEs  regardless  of  the  setting  of  the  mask 
registers,  thus  allowing  the  disabled  PEs  to  be  used  again. 

Two  architectures  will  be  used  here. 

The  first  architecture  studied  is  a  square  array1  of  N  x 
N'  PEs.  With  the  top  and  bottom  rows  connected  and  the 
leftmost  and  rightmost  columns  are  connected,  so  the  inter¬ 
connections  logically  from  a  torus.  The  input  is  an  .V  x  X 
image.  This  will  be  refereed  to  as  the  "plain  array”  architec¬ 
ture.  The  PEs  operate  in  a  SIMD  mode  under  the  direction 
of  a  single  controller  for  the  entire  array.  Each  processor  has 
its  own  local  RAM.  and  unless  stated  otherwise,  it  consists  of 
a  constant  number  of  words  of  memory,  where  each  word  has 
0(log  N)  bits.  Processors  communicate  directly  with  their  4 
closest  neighbors  one  neighbor  at  a  time.  It  is  assumed  that 
each  processor  has  a  mash  register. 

The  second  architecture  is  identical  to  the  first  except 
that  in  addition  to  the  square  array  of  PEs.  it  includes  a  tree 
of  PEs  per  row  of  the  array.  The  PE  trees  are  built  on  top  of 
the  array,  so  that  the  row  PEs  form  the  ieaf  nodes  of  a  tree 
The  root  of  each  tree  has  an  output  port  for  the  removal  of 
results  from  the  tree.  A  second  controller  is  provided  so  that 
the  PEs  internal  to  the  tree  can  Derform  one  operation  whi'° 
the  PEs  in  the  array  perform  another.  In  this  way.  additional 
parallelism  is  achieved.  This  architecture  will  be  referred  to 
as  the  "augmented  array”  architecture.  For  the  algorithms 
presented  in  the  next  section,  it  is  sufficient  that  the  PEs  in 
the  nonleaf  levels  of  the  trees  be  adders.  These  trees  can  be 
shown  to  be  useful  in  solving  a  number  of  important  vision 
problems  [21]. 

'Notice  that  arrays  of  processors  and  pixels  are  indexed  such  that 
(0.0)  is  in  the  lower  left  corner. 
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Algorithm  1 


The  time  analysis  of  the  algorithms  will  measure  simple 
operations  or.  1  word  quantities,  where  each  word  has  0(log 
X)  bits.  Such  a  word  size  is  sufficient  to  represent  values  the 
quantities  computed.  Although  operations  on  words  that  are 
longer  than  1  bit  are  not  directly  supported  by  the  bit-serial 
machines  currently  in  existence,  they  can  be  implemented 
in  software  as  a  sequence  of  bit-serial  operations.  The  time 
analysis  will  be  for  the  worst  case  data. 

Hough  and  Radon  Transforms 

The  Radon  Transform  of  a  gray-level  image  is  a  set  of 
projections  of  the  image  taken  from  different  angles.  Specif¬ 
ically.  the  image  is  integrated  along  line  contours  defined  by 
the  equation: 

{(x.y)  |  xcos{9)  +  ysm(8)  =  p  } 

where  9  and  p  are  the  orientation  and  the  offset  of  the  line, 
respectively.  It  may  be  assumed  that  0  <  8  <  ir.  because 
projections  are  the  same  for  8  and  for  8  +  ir  regardless  of 
the  value  of  8.  The  original  definition  of  the  Radon  trans¬ 
form  involves  a  line  integral,  but  for  digitized  pictures,  this 
integral  is  replaced  by  a  weighted  summation.  It  will  be  as¬ 
sumed  that  each  line  contour  is  approximated  by  a  1  pixel 
wide  band.  All  pixels  centered  in  a  given  band  contribute  to 
the  value  of  that  band.  Because  each  band  is  1  pixel  wide, 
there  are  at  most  >/2  N  values  of  p  for  each  value  of  8.  The 
parameter  P  will  be  used  to  denote  the  number  of  projections 
(values  of  9)  to  be  calculated.  The  computational  aspects  of 
the  discrete  form  of  the  Radon  transform  and  its  implemen¬ 
tation  in  pipeline  architectures  have  been  considered  [22]. 

The  Hough  transform,  often  used  to  locate  the  edges  of 
objects  in  an  image,  is  simply  the  case  of  the  Radon  trans¬ 
form  where  the  input  image  is  binary.  It  has  been  imple¬ 
mented  on  different  processors  such  as  the  GAPP  [8],  systolic 
arrays  [23],  and  other  pipeline  architectures  [24].  Special 
architectures  for  computing  peaks  of  the  Hough  transform 
have  been  proposed  [25]  and  built  [26],  In  addition,  Kushner 
et  al.  [27]  studied  the  problem  of  implementing  the  Hough 
Transform  on  the  MPP  and  concluded  that  because  the  pro¬ 
jections  are  at  various  orientations,  the  problem  "is  of  a  form 
that  the  fixed  geometry  of  the  MPP  cannot  easily  handle." 
There  follows  five  new  algorithms  for  the  Radon  (Hough) 
Transform  on  mesh  arrays  having  different  time  complexi¬ 
ties  and  resource  requirements. 

One  way  to  calculate  either  Transform  on  a  plain  array  is 
,A  rota'e  the  columns  of  the  image  until  all  of  the  pixels  that 
lie  in  the  same  band  (line  contour)  for  a  given  projection 
angle  are  in  the  same  row.  The  projection  is  then  calculated 
by  totaling  the  values  for  each  band  by  using  horizontal  shifts 
and  adds.  The  first  four  algorithms  are  based  on  this  general 
approach.  The  first  algorithm  presented  here  is  faster  by  a 
constant  factor  than  the  best  publish  algorithm  [S|  and  the 
remaining  algorithms  are  asymptotically  faster. 


The  first  algorithm  operates  on  a  plain  array  with  a  eon 
stant  number  of  words  of  memory  per  PE.  It  requires  Oi  PN  i 
time  to  calculate  P  projections  of  the  Transform.  The  rase 
where  j  <  9  <  ^  will  be  examined  first.  The  values  of  8  are 
treated  in  order  starting  with  j  and  increasing  through  -X- 
For  each  value  of  9.  the  controller  first  broadcasts  the  values 
sin(0)  and  cos (9)  to  all  PEs.  Each  PE  calculates 

d  =  x  cost 8)  +  v  sin(0). 

where  x  and  y  are  the  row  and  column  coordinates  of  the 
pixel  in  the  PE.  The  value  d  is  the  distance  from  the  origin 
to  the  line  passing  through  the  point  I  x.  v )  and  perpendicular 
to  8.  Xext  each  PE  calculates 

p  =  floor!  d) 

the  distance  from  the  origin  to  the  1  pixel  wide  band  passing 
through  ( x.y)  at  the  desired  angle.  Then  each  PE  calculates 

v  =  ceiling(p/sin(0)) 

the  smallest  value  on  the  y-axis  within  the  band  containing 
the  point  (x.y).  Each  PE  then  creates  a  record  containing 
its  pixel  value  and  the  variables  x,  y  and  v.  These  records 
are  shifted  downward  (cyclically,  because  of  the  connections 
between  the  top  and  bottom  rows)  until  each  record  is  in 
row  r  where  r  =  v  mod  N.  iSee  Figure  2.)  For  instance,  if 
the  record  originally  in  PE  (3,  10)  has  a  v  value  of  1,  the 
record  is  shifted  down  2  rows.  After  this  downward  shifting, 
each  row  r  will  have  the  pixel  values  of  all  points  in  bands 
that  cross  the  y-axis  at  distance  v  from  the  origin  where  v 
=  r  mod  X. 

Because  J  <  9  <  2f.  there  are  at  most  two  bands  with 
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different  values  of  v  on  the  same  row  r.  Also  at  most  two 
pixels  in  the  same  column  of  the  image  lie  within  the  same 
band  because  each  band  is  1  pixel  wide.  To  avoid  shifting 
one  record  on  top  of  another,  the  downward  shifts  should  be 
performed  in  two  stages,  one  for  pixels  with  even  y  coordi¬ 
nates  and  one  for  odd  y  coordinates.  Once  the  records  have 
been  shifted  downward  to  the  proper  row  r  =  v  mod  X.  they 
are  stored  for  future  use  in  calculating  the  next  projection. 
Then,  the  pixel  values  'or  records  with  odd  y  coordinates  are 
added  to  those  with  even  y  coordinates.  Xext.  the  total  of 
each  band  is  calculated  by  shifting  the  pixels  left  one  col¬ 
umn  and  adding,  then  shifting  them  left  two  columns  and 
adding,  then  shifting  them  left  four  columns  and  adding, 
etc.  Because  each  row  may  have  the  pixels  for  two  bands 
within  it.  this  summation  of  pixels  across  the  rows  must  be 
performed  in  two  nasses.  one  for  each  band.  The  totals  ac¬ 
cumulated  in  the  first  column  form  the  desired  projection 
The  next  projection  is  calculated  in  the  same  manner,  start¬ 
ing  with  the  records  in  the  PEs  where  they  were  stored  after 
t-  rfc.r.ing  the  vertical  shifts  in  the  calculation  of  the  pre'n 
ous  projection.  By  performing  the  projections  in  order  and 
storing  the  records  after  down  shifting  each  projection,  the 
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later  projections  can  take  advantage  of  the  shifts  performed 
for  the  earlier  projections.  As  a  result,  the  total  number  of 
downward  shifts  is  reduced.  In  fact,  it  can  be  shown  that 
only  0(P  +  N)  downward  shifts  are  required  for  all  of  the 
projection  angles  in  the  range  §  <  9  <  gp.  The  projections 
for  ^  <  9  <  |  are  calculated  in  the  same  way  starting  with 
the  original  unshifted  image  and  processing  the  9  values  in 
decreasing  order. 

There  is  a  problem  with  using  this  same  algorithm  to  cal¬ 
culate  the  projections  for  0  <  9  <  j  and  np  <  9  <  x.  These 
projections  use  bands  that  are  more  vertical  than  horizontal, 
so  many  bands  with  different  values  of  v  will  have  the. same 
value  of  r.  Because  these  bands  would  have  to  be  totaled 
sequentially,  the  total  time  for  the  algorithm  would  increase. 
One  solution  is  to  modify  the  algorithm  so  that  role  of  rows 
and  columns  is  reversed.  Then  each  band  would  be  shifted 
sideways  (instead  of  down)  until  each  occupies  a  single  col¬ 
umn  of  PEs;  totaling  the  bands  would  be  performed  within 
columns.  This  technique  would  guarantee  that  no  more  than 
2  bands  would  occupy  the  same  column  just  as  no  more  than 

2  bands  could  occupy  the  same  row  when  j  <  9  < 

A  different  technique  is  presented  here,  since  it  can  be 
easily  modified  to  create  an  algorithm  for  an  augmented  ar¬ 
ray.  This  algorithm  calculates  the  transpose  of  the  image 
before  calculating  the  projections  for  0  <  9  <  J  or  gp  <  6  < 
it.  To  obtain  the  transpose,  the  original  image  is  shifted  in 

3  stages:  In  the  first  stage,  the  pixel  in  PE  (x.v)  is  shifted 
down  x  times.  In  the  second  stage,  the  pixel  in  P(x.v)  after 
the  first  stage  is  shifted  to  the  right  y  times.  In  the  third 
stage,  the  pixel  in  PE  (x.y)  after  the  second  stage  is  shifted 
down  x  +  L  times.  The  result  is  actually  the  reflection  of 
the  transpose  about  a  vertical  axis,  but  this  is  sufficient  to 
guarantee  that  no  more  than  2  bands  are  ever  sent  to  the 


same  row. 


Algorithm  2 


In  the  Transform  algorithm  presented  above,  there  are 
0(P  +  N)  downward  shifts  and  O(PN)  horizontal  shifts. 
The  second  algorithm  is  identical  except  that  it  uses  the 
augmented  array  to  perform  the  horizontal  shifts  and  adds. 
The  augmented  array  can  calculate  totals  for  all  rows  in  par¬ 
allel  in  O(log  N)  time.  .More  importantly,  the  row  totals  for 
successive  projections  can  be  pipelined,  so  that  P  projections 
can  be  calculated  .n  O(P-l-N)  time.  As  soon  as  the  totals  for 
one  projection  angle  start  up  the  PE  trees,  the  vertical  shifts 
for  the  next  projection  angle  are  started.  This  overlap  of  hor 
izontal  totaling  and  vertical  shifting  is  possible  because  there 
are  two  controllers  present  in  the  augmented  array.  Again, 
it  is  important  that  the  (reflection  about  a  vertical  axis  of 
thel  transpose  is  used  so  that  all  totalling  occurs  along  rows 
and  can  thus  use  the  trees  of  PEs.  As  in  the  first  algorithm, 
the  second  algorithm  uses  only  a  constant  number  of  words 
of  memory  per  PF. 

Algorithm  3 

The  third  algorithm  can  perform  the  Hough  Transform 
on  a  certain  type  of  plain  array  machine  in  O(P-t-N)  time. 


The  required  plain  array  machine  must  have  two  leatures 
not  needed  for  the  O(PN')  time  algorithm.  First,  each  PE 
must  have  Of P )  words  of  memory.  Second,  the  PEs  must 
have  the  ability  to  perform  independent  addressing,  that  is. 
rather  than  ail  PE's  referencing  the  same  data  address,  they 
can  reference  different  addresses.  Independent  addressing 
can  be  implemented  in  SIMD  execution  mode  using  indirec¬ 
tion.  The  algorithm  uses  the  same  techniques  of  performing 
vertical  shifts  to  place  the  pixels  from  each  band  into  the 
same  row  of  PEs.  and  horizontal  shifts  to  total  the  pixels  in 
each  band.  Whereas  the  first  algorithm  alternated  between 
performing  the  vertical  shifts  and  totaling  the  bands  and  the 
second  algorithm  overlaps  the  totaling  for  the  bands  of  one 
projection  with  the  vertical  shifts  for  the  next  projection, 
the  third  algorithm  performs  the  vertical  shifts  for  all  of  the 
projections  before  calculating  any  of  the  totals  for  the  bands. 
First  the  vertical  shifts  for  the  projections  in  the  range  \  < 

9  <  are  performed  in  order:  then  the  vertical  shifts  for 
the  projections  j  <  9  j  are  performed.  Next,  the  transpose 
of  the  image  (reflected  about  the  vertical  axis)  is  calculated. 
Then  the  vertical  shifts  for  0  <  9  <  J  and  for  *>.  9  <  x  are 
performed.  Each  PE  uses  a  variable  called  the  “buffer  array” 
to  hold  the  2P  entries.  After  the  vertical  shifts  for  the  i-th 
projection  angle  are  performed,  the  shifted  pixels  are  stored 
in  the  buffer  array  either  in  location  i  or  in  location  i  4-  P. 
Specifically,  the  pixels  that  have  nonnegative  v  values  (refer 
to  algorithm  1  for  the  definition  of  v  values)  are  stored  in 
location  i  and  the  pixels  with  negative  v  values  are  stored  in 
location  i  +  P.  The  buffer  array  is  initialized  to  all  Os  before 
the  pixels  are  stored  in  it.  so  any  location  not  containing  a 
pixel  contains  a  0.  The  total  time  for  the  algorithm  to  this 
point  is  0(P+N). 

Each  PE  contains  a  buffer  array  with  2P  entries.  The 
entries  corresponding  to  a  single  band  of  a  projection  all  lie 
in  the  same  row  of  PEs  and  they  are  all  at  the  same  offset 
within  the  buffer  arrays.  The  totals  for  the  bands  are  now- 
calculated  in  0(P  +  N)  time.  Each  PE  has  a  variable,  called 
band.total.  that  is  initialized  to  0.  To  calculate  the  band 
totals,  it  begins  by  adding  the  contents  of  buffer  location  i 
to  the  band.total.  where  i  is  the  column  number  of  the  PE 
holding  the  buffer.  (It  is  this  operation  that  necessitates  the 
independent  addressing.)  Next,  it  shifts  the  band.totals  one 
column  to  the  right,  and  adds  the  contents  of  buffer  location 
(i-l)  mod  N  to  the  band.totals.  Again  the  band.totals  are 
shifted  one  column  to  the  right,  and  added  to  the  contents  of 
buffer  location  (i-2)  mod  N.  This  procedure  of  shifting  ami 
adding  is  repeated  X  times.  The  effect  ot  these  operations  is 
shown  in  Figure  3.  If  2P  <  X,  then  some  PEs  will  attempt 
to  access  buffer  locations  that  do  not  exist:  such  PEs  are 
disabled  while  the  buffer  is  being  accessed.  Once  the  N  hori¬ 
zontal  shifts  and  adds  have  been  performed,  each  band.total 
variable  contains  the  total  for  a  different  band.  Assuming 
that  the  leftmost  column  of  PE*  is  connected  to  the  array's 
I/O  ports,  the  band.total  variables  can  be  shifted  to  the  left 
N  times  to  remove  the  result  from  the  array.  This  totaling 
procedure  uses  O(N)  time.  If  2P  <  N.  the  Transform  is  com¬ 
plete.  If  2P  >  N.  then  set  the  band.total  variables  to  0  again 
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and  repeat  the  above  procedure  an  additional  ceiling  ( 2 P / N ) - 
1  times.  In  this  case,  the  totaling  requires  0( P)  time.  Either 
wav,  the  entire  algorithm  requires  0(P+N)  time. 

Algorithm  4 

The  fourth  algorithm  uses  a  plain  array  with  no  indepen¬ 
dent  addressing  and  0(P)  words  of  memory  per  PE  to  calcu¬ 
late  P  projections  of  the  Transform  in  0(P  log  P  +  N)  time, 
assuming  that  2P  <  N'.  Again,  each  PE  has  a  buffer  array 
with  2P  entries,  initialized  to  0.  The  technique  as  was  used 
in  the  third  algorithm  is  used  to  store  the  vertically  shifted 
pixels  into  the  buffer  arrays  in  0(P  +  N)  time.  The  fourth 
algorithm  differs,  however,  in  the  way  the  bands  are  totaled. 
In  the  third  algorithm,  independent  addressing  was  used  so 
that  the  S  band  .total  variables  in  each  row  accessed  different 
locations  in  the  buffer  arrays.  In  the  fourth  algorithm,  with 
no  independent  addressing  available,  the  contents  of  each 
buffer  array  is  rotated  downward  i  mod  (2P)  positions,  where 
i  is  the  number  of  the  column  of  the  PE  holding  the  array. 
These  downward  rotations  are  cyclic,  so  that  the  contents 
of  buffer  array  location  j  is  moved  to  buffer  array  location 
(i  +  j)  mod  (2P).  The  technique  to  accomplish  these  down¬ 
ward  shifts  will  be  explained  shortly.  Once  the  downward 
rotations  are  completed,  the  band.total  variables  are  used 
to  calculate  the  totals  for  the  bands:  First  they  are  initial¬ 
ized  to  the  contents  of  location  1  in  the  buffer  arrays.  Then 
they  are  shifted  one  more  column  to  the  right  and  added  to 
the  contents  of  location  2  in  the  buffer  arrays.  This  shift¬ 
ing  and  adding  procedure  is  repeated  until  each  band-total 
is  the  sum  of  N  entries.  (This  band  totaling  procedure  is 
illustrated  in  Figure  4.)  The  results  are  then  shifted  out  of 
the  array  with  N  left  shifts.  Because  it  was  assumed  that 
2P  <  N,  this  is  the  desired  result.  The  horizontal  shifts  and 
additions  require  O(N)  time. 

One  way  to  perform  the  downward  rotations  of  the  buffer 
arrays  would  be  to  process  in  parallel  only  those  columns 
that  are  being  rotated  the  same  amount:  all  other  columns 
would  be  disabled.  The  contents  of  each  buffer  array  in  the 
enabled  columns  would  be  copied  to  a  temporary  array  in 
the  same  PE.  Then  the  contents  of  this  temporary  array 
would  be  copied  to  the  buffer  array  in  the  rotated  position. 
To  rotate  by  a  single  amount  requires  O(P)  time.  Because 
there  are  O(P)  different  amounts  to  be  rotated,  the  entire 

rotation  procedure  would  require  0(PJ)  time. 

A  better  technique  is  based  on  the  idea  of  a  barrel  shifter; 
it  performs  the  rotation  in  0(P  log  P)  time.  Only  shifts  that 
are  powers  of  2  are  performed.  First,  some  buffer  arrays  are 
shifted  downward  1  position,  then  some  are  shifted  down¬ 
ward  2  positions,  then  some  are  shifted  downward  4  posi¬ 
tions,  etc.  The  PEs  that  are  enabled  during  a  given  shift  are 
chosen  according  to  the  following  rule:  Let  i  be  the  column 
number  of  the  PE.  and  let  k  =  i  mod  (2P)  be  the  number 
of  positions  that  itr  buffer  array  will  be  shifted.  Then  when 
a  downward  shift  of  2J  positions  is  being  performed,  only 
those  PEs  are  enabled  which  have  the  j-th  bit  of  k  equal  to 
1,  assuming  that  the  least  significant  bit  is  the  0th  bit.  A 
total  of  ceiling  (log  2P)  different  shifts  are  performed,  and 


each  shift  requires  O(P)  time,  so  all  of  the  downward  shifts 
can  be  performed  in  0(P  log  P)  time. 

In  the  above  analysis,  it  was  assumed  that  2P  <  N.  and  so 
a  total  of  0(P  log  P  -f  N)  time  is  required.  If  2P  >  N.  then 
the  entire  procedure  given  above  can  be  used  to  calculate 
the  first  N/2  projections  in  0( N  log  N)  time,  the  next  N/2 
projections  in  an  additional  0(N  log  N)  time.  etc.  A  total  of 
ceiling  (2P/N)  iterations  are  required,  yielding  a  total  time 
of  0(P  log  S). 

Algorithm  5 

Finally,  the  fifth  algorithm  uses  a  plain  array,  no  inde¬ 
pendent  addressing  and  only  a  constant  number  of  words  of 
memory  per  PE  to  calculate  P  projections  of  a  Transform 
in  0(P+N)  time.  The  technique  used  differs  significantly 
from  the  techniques  used  in  the  four  previous  algorithms. 
The  previous  algorithms  all  performed  vertical  shifts  on  the 
image  in  order  to  simplify  the  totaling  of  the  pixels  in  each 
band.  This  algorithm,  however,  does  not  perform  such  ver¬ 
tical  shifts.  Instead,  the  total  for  a  band  is  calculated  by 
having  a  variable,  called  “band-total".  visit  all  of  the  pixels 
in  its  assigned  band.  As  a  band.total  encounters  a  pixel  in 
its  band,  it  adds  the  pixel’s  gray  level  to  itself.  An  informal 
description  of  the  algorithm  is  given  next. 

To  understand  the  algorithm,  it  is  useful  to  examine  how 
a  single  band.total  variable  is  moved  across  the  image.  This 

band.total  is  assigned  a  particular  value  of  “v"  and  ~6 "  (refer 
to  algorithm  1  for  a  definition  of  v).  It  must  move  across  the 
image  in  such  a  manner  that  it  visits  all  of  the  pixels  (x.y) 
where 

floor  (x  cos(0)  +  y  sin(0))  =  v. 

The  set  of  pixels  that  must  be  visited  by  this  band.total  will 
be  referred  to  as  the  pixels  “owned"  by  the  band.total.  An 
example  of  the  set  of  pixels  owned  by  a  band.total  is  shown 
in  Figure  5. 

In  the  following  it  will  be  assumed  that  t  <  9  <  -y. 
though  the  same  techniques  apply  to  the  other  cases.  In 
particular,  when  j  <  0  <  y,  the  roles  of  up  and  down  shifts 
are  reversed.  The  remaining  cases  are  identical  except  that 
the  roles  of  the  rows  and  columns  are  reversed. 

Because  f  <  9  <  each  band.total  owns  at  most  2 
pixels  in  each  column.  One  way  the  band.total  could  visit 
these  pixels  is  to  visit  each  PE  that  contains  them.  This 
has  the  disadvantage  that  the  band.total  must  shift  verti¬ 
cally  to  access  both  of  the  pixels  it  owns  in  a  single  column 
The  algorithm  presented  here  avoids  these  vertical  shifts  by 
performing  a  single  downward  shift  of  the  image.  There  are 
thus  two  versions  of  the  image  in  each  PE.  the  original  ver¬ 
sion  and  the  down-shifted  version.  The  band.total  is  moved 
across  the  image,  visiting  the  PE  containing  the  lower  of  the 
band-total's  pixels  in  each  column. 

The  set  of  PEs  visited  by  the  band.total  in  Figure  5  is 
shown  in  Figure  6.  The  set  of  PEs  that  a  band.total  must 
visit  will  be  referred  to  as  the  PEs  that  are  “owned”  by  the 
band.total.  In  order  for  a  band.total  to  visit  all  the  PF.s 
it  owns,  it  is  placed  initially  in  the  leftmost  PE.  It  is  then 


shifted  one  column  to  the  right  and  then  optionally  shifted 
up  one  row  to  visit  the  next  PE  that  it  owns.  This  process 
of  shifting  to  the  right  and  optionally  shifting  up  is  repeated 
until  the  band.total  has  visited  all  of  its  PEs.  The  decision  as 
to  whether  or  not  the  band.total  should  be  shifted  upwards 
is  based  on  a  local  calculation  that  will  be  explained  below. 

Having  examined  the  behavior  of  a  single  band.total.  it  is 
now  possible  to  explain  how  the  band.totals  operate  in  par¬ 
allel.  Consider  the  calculation  of  a  single  projection  angle  9, 
still  assuming  that  §  <  9  <  This  projection  is  calculated 
by  placing  the  band.totals  in  the  first  column  of  PEs.  each 
in  the  PE  that  it  owns.  The  PE  owned  by  a  given  band-total 
sets  its  columnjcontrib  (column  contribution)  variable  to  the 
sum  of  the  gray  levels  of  the  pixels  owned  by  that  band.total 
in  the  column.  The  band.total  adds  this  columnjcontrib  to 
its  current  value.  The  band.totals  are  then  shifted  right 
to  the  second  column  and  some  are  shifted  up  so  that  all 
are  placed  in  the  PE  that  they  own  in  the  second  column. 
Again,  the  PEs  set  their  column-contrib  variables  and  these 
variables  are  added  to  the  band.totals.  This  procedure  is 
repeated  until  all  columns  have  been  visited.  At  any  one  • 
time,  the  band.totals  for  the  given  projection  angle  9  are  all 
located  in  the  same  column  of  PEs.  Note  that  collisions  be¬ 
tween  band.totals  are  impossu.-r  because  the  paths  of  two 
band.totals  with  the  same  value  of  9  can  never  cross. 

To  calculate  the  projection  for  a  given  angle  theta,  it 
is  necessary  to  visit  the  PEs  in  the  first  column,  the  PEs 
in  the  second  column,  etc.,  until  all  of  the  columns  have 
been  visited.  Thus,  the  PEs  that  are  used  in  calculating  one 
projection  of  the  Transform  form  a  wave  propagating  from 
the  left  to  the  right.  Accordingly,  it  is  possible  to  pipeline  the 
calculation  of  the  Transform  by  starting  the  calculation  for 
the  second  projection  angle  as  soon  as  the  nrst  has  finished 
using  the  PEs  in  the  first  column.  As  many  as  N  projections 
can  be  pipelined  in  this  manner  at  one  time.  Because  the 
calculation  of  a  single  projection  requires  O(N')  time,  and 
because  the  calculation  of  P  projections  can  be  pipelined, 
the  calculation  of  the  entire  Transform  can  be  performed  in 
0(N  •+•  P)  time. 

The  above  discussion  ignored  a  number  of  details  of  the 
algorithm.  In  particular,  it  did  not  specify  how  the  band.totals 
determine  when  to  shift  up.  how  the  columnjcontrib  vari¬ 
ables  are  calculated,  or  how  bands  that  do  not  include  pixels 
in  both  the  first  and  last  columns  are  handled.  These  issues 
are  addressed  next. 

When  the  calculation  of  a  new  projection  angle  9  is  started, 
the  values  cos(0)  and  sin(0)  are  broadcast  from  the  controller 
to  the  PEs  in  the  first  column.  These  values  are  shifted  right 
whenever  the  corresponding  band.totals  are  shifted  right.  In 

addition,  each  band.total  is  accompanied  by  a  band  .number 
identifying  which  band  it  is  following.  After  performing  a 
right  shift  of  the  band.total.  band_number.  sin(0)  and  cos|0). 
the  PEs  calculate 

d  =  x  cos(9)  +  v  sin(0)  and  p  =  floor(d). 

Then  each  PE  determines  if  it  is  the  lowest  PE  in  its  col¬ 


umn  with  its  value  of  p.  This  is  accomplished  by  shifting 
the  p  values  up  one  row.  Each  PE  compares  the  received  r 
value  with  its  own  p  value,  and  if  they  match,  sets  its  own 
p  value  to  infinity.  At  this  point  the  PEs  with  finite  p  val¬ 
ues  are  the  ones  owned  by  some  band.total  for  the  current 
projection  angle  9.  Next,  the  (possibly  infinite)  p  values  are 
shifted  down  one  row.  The  down-shifted  p  values  determine 
whether  or  not  there  are  two  pixels  in  the  given  column  that 
lie  in  the  same  band:  If  a  PE’s  own  p  value  is  finite  and  the 
p  value  that  is  received  from  the  PE  above  it  is  finite,  then 
the  PE  knows  that  it  contains  the  onlv  pixel  in  its  column 
lying  in  its  band.  From  this  information  each  PE  with  a 
finite  p  value  sets  its  column.contrib  variable  to  the  sum  of 
the  gray  levels  of  the  pixels  in  its  column  and  band.  Next, 
each  band.total  examines  the  p  value  of  the  PE  in  which 
it  is  located.  If  the  PE  has  a  finite  p  value  matching  the 
band_number.  then  the  band.total  is  in  the  correct  PE.  Oth¬ 
erwise.  the  band.total  and  band-number  are  shifted  up  one 
row.  The  column-contrib  is  then  added  to  the  band.total. 

So  far.  band.totals  have  only  been  created  for  bands  con¬ 
taining  pixels  in  the  leftmost  column.  However,  some  bands 
may  not  contain  any  pixels  in  the  leftmost  column.  Such 
a  band  is  handled  in  a  similar  manner,  with  the  only  dif¬ 
ference  being  that  its  band.total  variable  is  created  in  the 
leftmost  column  having  a  pixel  in  the  band.  Also,  it  has 
been  assumed  that  all  of  t  he  bands  continue  all  the  way  to 
the  rightmost  column,  although  in  reality  some  do  not  (they 
go  off  the  top  or  bottom  of  the  image).  When  a  band.total 
goes  off  the  top  of  the  image  before  reaching  the  right  hand 
edge,  it  arrives  in  the  bottom  row  of  PEs  because  of  the 
toroidal  connections  between  the  top  and  bottom  rows.  It 
continues  to  follow  the  wave  of  processing  associated  with  its 
projection  angle  9.  but  it  is  no  longer  increased  by  the  col¬ 
umn.contrib  variables  it  encounters.  The  description  given 
above  was  a  simplification  because  it  assumed  that  each  PE 
has  a  single  band.total  variable  at  a  given  time.  In  reality, 
a  PE  can  nave  two  band.total  variables  at  once:  one  that  is 
actively  visiting  PEs  in  its  band  and  one  that  has  completed 
its  calculations  and  has  gone  off  the  top  or  bottom  of  the 
image. 

The  algorithm  detailed  more  fully  in  a  technical  report 

requires  20P  —  43  N  —  4  communication  operations,  and 
a  similar  number  of  iocal  operations,  to  calculate  P  projec¬ 
tions.  While  this  is  fairly  fast,  it  is  possible  to  improve  the 
speed  in  certain  situations.  If  the  same  set  of  projection 
angles  is  used  repeatedly  for  calculating  the  Transform  for 
many  images,  the  values  of  sin($)  and  cos(0)  need  not  be 
shifted  across  the  image,  since  the  path  of  each  band.total 
variable  across  the  image  is  completely  determined  by  the 
values  of  9.  Hence,  the  paths  can  be  pre-computed  and 
sto.ed  in  the  PEs.  Each  PE  can  contain  a  single  bit  for 
each  time  a  "where"  clause  in  the  program  is  executed.  This 
bit  indicates  whether  or  not  the  "where"  clause  is  true  at 
the  given  time.  This  requires  an  additional  P  bytes  of  mem¬ 
ory  per  PE.  When  this  approach  is  used,  only  6P  •+•  1  IN 
-r  4  communication  operations  are  required  to  calculate  P 
projections.  In  addition,  this  approach  requires  fewer  lo¬ 
cal  operations  and  no  multiplications.  Finally,  it  should  be 
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noted  that  in  any  case  the  multiplications  can  be  avoided  if 
the  values  of  cos(0)  and  sin(0)  are  sufficiently  accurate,  since 
the  function  x  cos(01  y  sin(0)  can  then  be  calculated  by- 
using  forward  differencing. 


Five  new  algorithms  for  calculating  Radon  ( Hough)  Trans¬ 
form  on  mesh  arrays  have  been  presented.  The  asymptoti¬ 
cally  fastest  algorithm  previously  published  requires  O(PN’) 
time  to  calculate  P  projections  of  the  Transform,  while  three 
of  the  algorithms  presented  here  require  only  0(P  +  N)  time. 

In  addition  to  the  theoretic  value  of  these  algorithms,  the  au¬ 
thors  hope  that  they  will  prove  fast  enough  to  have  practical 
applications.  Typically  values  of  the  parameters  N  and  P  are 
512  and  180.  respectively,  so  the  algorithms  presented  here 
do  seem  to  offer  a  time  savings  for  realistic  problems. 
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